Introduction
In this R markdown, you will be able to find the solution for HW-1 of SDS 2023/2024. In the HTML format, you will be able to read our comments and visualize our plots. Yet, if you wish to investigate the code that generated them, or the code that was used “behind the scenes”, you should refer the .Rmd file. The file includes also interactive plots. In the event that such plots are not visible in the HTML format, please refer to this link here.
Exercise 1
Setup
There is a class of \(n = 150\) students, where \(k\) are liars and \(n-k\) are honest. We are interested in classifying each person in one of these two groups (liar and honest). To do this we will let them toss a coin. Honest students toss a fair coin, i.e. a coin with probability \(p\) of head equal to \(0.5\). Liar students, instead, will toss a coin with probability \(q\) of head, where \(0.50 < q \leq 1\). So there will be \(n-k\) Bernoulli random variables with probability of head equal to \(p\) (we call them \(H\)) and \(k\) Bernoulli random variables with probability of head equal to \(q\) (let’s call them \(L\)).
Thus the number of students \(n\) and the probability of head for honest students \(p\) are fixed.
Our task is to choose a number \(N\) of tosses so each student in the class will toss their coin \(N\) times and then, based on the results of these tosses, decide if the student is honest or not (thus \(q\) is known, but something we can’t affect in any way, while we can choose \(N\)). For this end, we need to define a decision rule. Ideally, we would like this rule to minimize the classification errors. Indeed, we could claim that the student is a liar when he is actually honest (False Positive) or that the student is honest when he is actually a liar (False Negative).
We start by trying to understand how the distribution of Heads within the two sub-populations of honest and liar students changes in relation to \(q\) and \(N\). For this purpose we fix \(k\), the number of liars, and then sample from two Binomial distributions with \(n-k\) and \(k\) observations, \(N\) trials, and \(p\) and \(q\) probability of success respectively. We can do this since the outcomes of the tosses are independent Bernoulli r.v. and the total number of heads is the sum of these outcomes. Indeed, it can be shown that \(\sum_{i=1}^nBer(p) \sim Binom(n, p)\) (where \(n\) and \(p\) are generic and do not refer to any of the parameters stated above).
We are now going to plot the two distributions when \(q=0.6\), after fixing \(k=75\).
We notice that they overlap a lot and as a consequence its difficult to distinguish them. We can now imagine to have a classification threshold at \(x=28\) (we are going to explain this better in the next lines). We can see that the probabilities of false positives and false negatives would be very high (the area in which they overlap after/before the threshold).
Then we change \(q\) and plot the two distributions again.
We also want to see how the two distributions change as we increase the number of tosses (with a fixed \(q\)). We choose \(N_1=25\) and \(N_2 = 75\) and plot the distributions.
As expected we see that, as \(N\) and \(q\) grow, these two sub populations become more and more different and thus distinguishable and separable. As a result, we also expect the probability of false positives, which we call \(\alpha\) and the probability of false negatives, called \(\beta\) to decrease (as \(N\) and \(q\) increase). To compute these probabilities we can start by choosing a simple decision rule like *“Classify a student as liar if the number of Heads is greater or equal to* \(70\%\) of the number of tosses (\(N\))”.
Next, to compute \(\alpha\) we use the CDF of the binomial distribution of the tosses of the honest students to get the probability that the number of heads is \(\geq\) than our threshold (given \(p = 0.5\)). Whereas, we compute \(\beta\) as the probability that the number of heads is \(<\) than our threshold when the tosses come from a binomial distribution with probability of success equal to \(q\) .
To summarize: \[ \begin{cases} \alpha = \text{P( # Heads| Honest)} \geq 0.7 \\ \beta = \text{P(# Heads| Liars) < 0.7} \\ \end{cases} \]
We first fix \(q=0.75\) and see how these probabilities change as we increase \(N\).
With a low number tosses the probability of \(\beta\) is high (around \(0.25\)), while \(\alpha\) is significantly lower (around \(0.05\)). As \(N\) increases, both values decrease.
Indeed, when \(N\) is low, if I count many heads in the tosses of a given student I am already able to (often) correctly classify the student as a liar since the threshold we imposed is quite high. However, it also can happen that an honest students gets an high number of heads thus causing me to classify that student as a liar. This is more probable when the number of tosses is lower since our threshold is a function of \(N\). As \(N\) grow, this mistake gets smaller and smaller.
On the other hand, since the threshold is quite high and close to \(q\) , it will be difficult in the beginning to classify a student as liar. Also this mistake will decrease as we require an higher number of tosses.
Of course we expect higher initial probability of errors with a lower \(q\) (closer to \(p\)), since it will be difficult to distinguish the two populations. As \(q\) increases, both probabilities should go down.
In the plot where we fix \(N=50\) and we increase \(q\) by \(0.02\) from \(0.53\) up until \(1\). The threshold is the same as before.
We can see that, since the number of tosses and the threshold are quite high, the probability of False Positives is zero even when \(q\) is low. Conversely, the probability of False Negatives is very high for low values of \(q\) as, according to our prudent decision rule, no student will be classified as liar. As the probability of heads for the liars increases, instead, we will be able to spot them more effectively.
Finally, we would like to add that also \(k\) influences our estimates of FP and FN. We chose \(k\) to make the two distributions balanced. Indeed, doing so we can more easily carry on our probabilistic analysis. Furthermore, if the distributions were unbalanced, then it might be easier to find and by applying decision rule that favors one of the two classes.
1.1 - Score function
We now need to design a score function \(f(N | \alpha^*, \beta^*)\) that, for every value of \(N\) and two selected target error level \(\alpha^*\) and \(\beta^*\) that we want to achieve, outputs a numerical evaluation of the strategy. Our strategy should, indeed, optimize this score function.
Restating the objective: we want \(\alpha\) and \(\beta\) to get as close as possible to \(\alpha^*\) and \(\beta^*\), while keeping the number of tosses reasonable.
\[ \alpha -\alpha^* \rightarrow 0 \\ \beta -\beta^* \rightarrow 0 \\ \]
Of course we don’t care if the deviations from the target values are positive or negative so we consider the absolute value of the differences. Thus, we are interested in minimizing these differences. As stated above, we also want to introduce \(N\) in our function. In particular, we want to penalize values of \(N\) that are too big if there isn’t a sensible improvement in how close we got to the target values leveraging those extra tosses. However, we do not want the function to be dominated by \(N\) as, when we try to minimize it, the main way to do it would be reduce \(N\). So we take the logarithm of this value. Another reason to introduce \(N\) in the equation is that the score function would otherwise be monotonically decreasing (up to a certain value of \(N\)) if we require low target values.
Our score function explicitly is:
\[ f(N|\alpha^*,\beta^*) = (|\alpha - \alpha^*| + |\beta - \beta^*|)\cdot \log N \]
(We notice that this function resembles the Absolute Loss)
Literature:
The first part of the score is actually the Manhattan distance. The initial component of the score reflects the comparison between two crucial quantities of interest: alpha and beta. Rather than directly measuring the distance between coordinates, we assess the differences or errors associated with these quantities. This approach allows us to capture and evaluate the discrepancies in alpha and beta, offering a nuanced understanding of their relationship without delving into the technicalities of distance calculations.
After having defined our score function we will try to evaluate different strategies, i.e. we will compute for different values of \(N\) and \(thresholds\) the score function, and then output the values that minimize it. We start by fixing \(q=0.6\) as with a \(q\) too big the problem would become quite trivial (indeed, not many tosses and an high threshold would be enough to get small values of \(\alpha\) and \(\beta\)). We also assume that we want to obtain low probabilities of False Positives and False Negatives, which is reasonable. In particular we are more concerned with \(\alpha\), thus we are going to require a smaller \(\alpha^*\), since it is worse to accuse a student of being a liar even when he is honest. In particular, we decide that we want a probability of False Positives of \(5\%\) and a probability of False Negatives of \(15\%\).
# Define the score function
score_function <- function(a, b, alpha, beta, N) {
(abs(a - alpha) + abs(b - beta))*log(N)
}
# Define Parameters
q <- 0.6
a_target <- 0.05
b_target <- 0.15## Optimal N: 153
## Optimal Threshold: 86
## Lowest Score: 0.01743232
Indeed, what we see is that, as we increase \(N\), \(\alpha\) and \(\beta\) become smaller and smaller, thus initially the score gets smaller as we converge towards the values of \(\alpha^*\) and \(\beta^*\). Nevertheless, after \(N=153\) and a threshold of \(86\) heads, the score begins to rise again either because the probabilities are too small and therefore different from our target values or because the bigger values of \(N\) are not justified by the improvement in “performance”. Indeed, if we try to compute the probability of FP and FN with these parameters we get
## Alpha: 0.05279594
## Beta: 0.1493306
The probability of false positives is very close to our target \(0.05\). The probability of false negatives is very close to our target \(0.15\) too.
Next, we try to use a score function more similar to the Squared Loss (instead of the Absolute Loss), to see if we can notice big differences in its results, i.e. the score function will be
\[ f(N|\alpha^*,\beta^*) = ((\alpha - \alpha^*)^2 + (\beta - \beta^*)^2)\cdot \log N \]
## Optimal N: 153
## Optimal Threshold: 86
## Lowest Score: 4.157862e-05
We see that we get exactly the same results, thus we are going to keep using the absolute difference. One further modification we could do to the score function is to scale down \(N\) in a different way. For example we could use the square root, thus obtaining a score function of this type
\[ f(N|\alpha^*,\beta^*) = (|\alpha - \alpha^*| + |\beta - \beta^*|)\cdot \sqrt N \]
## Optimal N: 153
## Optimal Threshold: 86
## Lowest Score: 0.04286423
Again we notice that with this \(q\) , \(\alpha^*\) and \(\beta^*\) there doesn’t seem to be much difference.
Finally, a last tweak to the score function could be a different weighting of the two differences. Specifically, since we are more concerned with the probability of false positives we could give more weight to the first difference of our score function, obtaining something like: \[ f(N|\alpha^*,\beta^*) = (10 \cdot |\alpha - \alpha^*| + |\beta - \beta^*|)\cdot \log N \]
## Optimal N: 162
## Optimal Threshold: 91
## Lowest Score: 0.07756323
We notice that putting more weight on \(\alpha\) leads to an higher threshold and an higher \(N\). If we now compute again these probabilities we get:
## Alpha: 0.04933044
## Beta: 0.14145
Indeed, \(\alpha\) got closer to \(\alpha^*\) (however \(\beta\) is now more distant from \(\beta^*\)).
Now we want to see how the optimal \(N\) and threshold vary as \(q, \alpha^*, \beta^*\) change. To this end we are going to use our first score function.
## Q Alpha_Target Beta_Target Optimal_N Optimal_Threshold Lowest_Score
## 1 0.51 0.01 0.01 2 1 0.3258485
## 2 0.51 0.01 0.03 2 1 0.3119855
## 3 0.51 0.01 0.05 2 1 0.2981226
## 4 0.51 0.01 0.07 2 1 0.2842597
## 5 0.51 0.01 0.09 2 1 0.2703967
## 6 0.51 0.01 0.11 2 1 0.2565338
In this plot we can see the best value of \(N\) for increasing values of \(q\) and different \(\alpha^*\) and \(\beta^*\). We notice that, as we would expect, higher \(N\) are required for smaller target values when \(q\) is small. As the probability of heads of the liars increases the number of tosses required decreases for every possible target probability.
In the next plot we isolate the cases of \(\alpha^*=\) 0.05 and 0.17, and \(\beta^*=\) 0.05 and 0.19 and we plot them to better understand what happens.
We can see that as \(q\) increases the number of tosses required drastically decreases. Instead, when \(q\) is very to close to \(p\), the additional cost required by making more tosses doesn’t compensate for the improvement in how close we get to the target values, thus small values of \(N\) are preferred. Moreover, we can notice that if we aim for low \(\alpha^*\) more tosses are generally required, while if we aim for low \(\beta^*\) less tosses are preferred. From the bottom right plot we can also notice that if higher target values are allowed then we will be able to reach them even when \(q\) is quite close to \(p\) (at the cost of tossing the coin many times).
1.2 - Time Constraint
Now, beside the α − β targets, we assume we’re also working under a strict time constraint: knowing that each flip costs \(3\) seconds, we want to check at least half the class in no more than \(T\) minutes.
To handle this new scenario we tweak the original score function to get a new one \(g(N | \alpha, \beta, T)\). Restating our objective: now we want that at least half the class, i.e. \(\frac{n}{2}\) students to toss a coin \(N\) times, with each toss requiring \(3\) seconds, for a total of, at least \(\frac{n}{2} \cdot N \cdot 3\) seconds. We want this quantity to be \(\leq\) \(T \cdot 60\) seconds. So we have to introduce a penalization if the difference between these two times is high. In particular, for simplicity we are going to take the absolute value of the difference since it is, as we also have seen in the results above, very difficult to get low \(\alpha\) and \(\beta\) when \(q\) is close to \(p\) and \(N\) is low. Again we need to scale down this difference otherwise it will dominate completely the function.
Thus we have the following score function: \[ g(N| \alpha^*, \beta^*, T) = \log(N) \cdot (|\alpha - \alpha^*| + |\beta - \beta^*|) + \sqrt{\left(|\frac{N \cdot n}{40} - T|\right)} \]
Again we will try to minimize this function. We fix \(q=0.6\), \(\alpha^*=0.05\) and \(\beta^*=0.15\)
# Define the score function
score_function2 <- function(a, b, T, alpha, beta, N, n) (abs(a - alpha) + abs(b - beta))*log(N) + sqrt(abs((3 * N * (n / 2)) / 60 - T) + 0.000001)
# Set other variables
T <- 15
n <- 150
a_target <- 0.05
b_target <- 0.15
q <- 0.6
p <- 0.5## Optimal N: 4
## Optimal Threshold: 2
## Lowest Score: 0.4053821
The graph shows the scores obtained by varying \(N\) and the threshold. We see that, as we are requiring a target \(\alpha\) smaller than \(\beta\) the threshold will remain quite high. Moreover, we can notice that the function penalizes, as we desire, heavily increasing values of \(N\) when \(T\) is not as big. Indeed we want the time required to carry on our testing to remain smaller than \(T\).
In particular, for \(T=15\) and \(q=0.6\) we notice that the optimal \(N\) is equal to \(4\) and the optimal threshold to \(2\). We can notice that \(3 \cdot 4 \cdot 75\) (i.e. \(3 \cdot N \cdot n/2\)) is equal to \(15 \cdot 60\) (i.e. \(T \cdot 60\)). This means we are finding the values of tosses and threshold that give us the probability of false positives and false negatives closer to the target ones respecting the time constraint imposed by \(T\). Even if, with these parameters, we obtain
## Alpha: 0.3125
## Beta: 0.1792
The rigid time constraint doesn’t allow us to improve further. However, by relaxing \(T\) or increasing \(q\), we notice that it is easier to get close to our targets. Specifically, if we want to keep \(T\) fixed, we should make sure \(q\) is around at least \(0.7\), since we would get \(N=4\) and \(threshold=3\), which would give us \(\alpha=0.0625\) and \(\beta=0.34\) which are much closer to our targets.
Instead, if we have a low value of \(q\) as before, we should allocate more time to our testing procedure, as with \(T=30\) we would get \(N=8\) and \(threshold = 4\), which wouldn’t help us much in reducing the probability of False Positives (\(0.36\)), but at least would bring the probability of False Negatives to \(0.17\), again much closer to our desired target.
Since we have this rigid time constraint and we saw we cannot achieve low target probabilities, a good idea could be to choose to prioritize one of the two. For this reason we use a weighted score function, as we already did in exercise 1.1, that gives more importance to a more precise \(\alpha\). In particular, we are going to increase its weight ten fold as we want to heavily penalize big differences.
## Optimal N: 4
## Optimal Threshold: 3
## Lowest Score: 0.3690266
Indeed, we can notice that now, keeping \(q=0.6\), the threshold is higher (it went from \(2\) to \(3\)), since now we are focusing much more on False Positives and an higher threshold minimizes this kind of error (while increasing the False Negatives). Let’s compute \(\alpha\) and \(\beta\):
## Alpha: 0.0625
## Beta: 0.5248
We see that \(\alpha\) is very close to our target, while \(\beta\) is quite far.
Exercise 2
Setup
In this task, our objective is to delve deeper into the concept of Kernel Density Estimation (KDE). It is crucial to emphasize the significance of parameter selection to optimize the fit with the real population.
The bandwidth determines the width of the kernel, influencing the smoothness of the estimated density. A smaller bandwidth results in a more sensitive estimation that captures finer details in the data. On the other hand, a larger bandwidth produces a smoother estimate, potentially overlooking local variations in the distribution. So, selecting an appropriate bandwidth is crucial for achieving a balance between capturing the underlying distribution’s features and avoiding overfitting or underfitting.
Specifically, we consider the true population distribution denoted as \(F_X(\cdot)\), where the random variable \(X\) is assumed to follow a Beta distribution with user-chosen parameters \(\alpha\) and \(\beta\). For the purposes of this exercise, we employ a Kernel Density Estimator \(f_{bh}(\cdot)\) with a tunable bandwidth parameter \(h > 0\) as the sole parameter, and in this case, we opt for a boxcar/rectangular kernel.
The primary steps involve implementing the Kernel Density Estimator with the selected boxcar kernel and bandwidth \(h\), computing its quantile function, and then comparing it with the true quantile function. To quantify the discrepancy between the two quantile functions, we use the Wasserstein distance. Subsequently, an appropriate \(h\) is chosen based on the assurance that this bandwidth parameter ensures the Wasserstein distance is below a specified threshold \(\epsilon\).
Boxcar Kernel
Statistically speaking, a boxcar function or boxcar kernel can be related to a rectangular window in the context of smoothing or filtering operations. Boxcar kernel is often used in time-series analysis or signal processing. The boxcar function acts as a weight that gives equal importance to all data points within a specified interval.
As by the definition:
\[ \displaystyle \widehat{f}_h(x) = \dfrac{1}{nh}\sum_{i=1}^n K\left(\dfrac{x_i-x}{h}\right) \]
where
\[ K = \frac{1}{2} \]
is the Boxcar kernel.
And each value \(x_i\) gets smoothed if \[-1<\dfrac{x_i-x}{h}<1 \qquad\qquad \Rightarrow \qquad\qquad x_i-h<x<x_i+h \]
as the boxcar transformation is applied for \(x \in [-1,1]\).
In general we can’t integrate smoothly because we don’t know where \(x_i\) are and the set of \(x_i\) depends on too many parameters and we would require a decision tree conversely to just the two parameters required for example for a Beta distribution. However, always in the general case, we can integrate considering the integral for intervals or leveraging cumulative functions, techinque which we will also apply to our problem later on for computing the CDF for computational efficiency.
In the case of a Beta distribution we note that extreme values of \(x_i\) are smoothed by incorporating values outside of the support of the beta distribution, leading to potential issues.
2.1 - Implementation of Boxcar KDE
We implement the boxcar kernel as by definition. We then use a cumulative function to derive the CDF leveraging the notion of integration by summation since it is considerably faster than the function *integrate()* and gives good approximations when we have enough \(x\). Moreover, even the integrate() function discretizes the space (we are aware of the fact that the discretization performed by the function is more precise on average, but the time required to run it is also notably higher).
boxcar_kde <- function(x, data, h=0.5) {
n <- length(x)
t1 <- 1/(n*h)
t2 <- sapply(x, function(xi) {
sum(0.5 * (abs((xi - data) / h) <= 1))
})
return(t1*t2)
}
boxcar_cdf <- function(x, data, h = 0.5) {
pdf_values <- boxcar_kde(x, data, h)
cdf_values <- cumsum(pdf_values)/sum(pdf_values)
return(cdf_values)
}We proceed and fulfill the task of implementing the quantile function of our kernel.
The quantile function and its implementation in the boxcar kernel context offer a mean to understand the distribution in terms of percentiles.
boxcar_quantile <- function(cdf_values, x_values, probs) {
sapply(probs, function(p) {
idx <- which.min(abs(cdf_values - p))
return(x_values[idx])
})
}This function estimates quantiles by finding the \(x\) values associated with probabilities using the closest available CDF values. This method gets less precise when the CDF values are not dense.
In the next step, we will introduce some matemathical tools to estimate the difference in terms of quantiles, so differences between quantile functions of the beta and of the kernel respectively.
We especially focus on the Wasserstein distance (a loss metric) and, as anticipated, we expect favorable performance when the two quantile functions closely resemble each other. By summing the distances and taking the absolute value, we can see how effectively the boxcar kernel’s CDF aligns with the true CDF of the underlying distribution.
Indeed the Wasserstein function measures the dissimilarity in their shapes. Our objective is to minimize this Wasserstein distance. Additionally, the boxcar_quantile function facilitates the extraction of quantiles from the boxcar kernel’s CDF, providing a simple evaluation of the model’s representation of various percentiles within the distribution.
2.2 - First application
For the first case, we have chosen \(\alpha=2\) and \(\beta=2\). These parameter values define the shape of the beta distribution. We proceed by calculating and plotting the distributions. We also fix a bandwidth for the estimator.
# We set the parameter and the seed first
set.seed(13)
a <- 2
b <- 2
NN <- 500
# We name data1, we'll use it later on.
data1 <- rbeta(NN, a, b)
x_vals <- seq(0, 1, length.out = length(data1))
band <- 0.1In this graph we can compare how well the Boxcar_kde aligns with the true Beta distribution taking into account the given data, which is sampled from a Beta distribution with the parameters specified above. The cyan line reflects the approximation of the underlying distribution provided by the boxcar_kde. The bandwidth parameter (\(0.1\)) influences the width of the kernel, affecting the smoothness of the estimated density. A smaller bandwidth results in a more sensitive estimation, whereas a larger bandwidth produces a smoother estimate.
To show this we make another plot with a larger \(h\).
We notice that the KDE is smoother but also less precise. Indeed, with a low bandwith we’re more influenced by the data and get a better fit (however we risk to overfit).
In our scenario we have a Beta distribution with chosen \(\alpha\) and \(\beta\) and we want to compute the Wasserstein distance with the quantile transformation. We underline that setting a bandwidth for the kernel directly impacts our approximation of the data and, eventually our loss.
wasser <- function(z){
cdf <- boxcar_cdf(x_vals, data1, band)
return(abs(qbeta(z, a, b) - boxcar_quantile(cdf, x_vals, z)))
}We now show the CDFs and Quantile function comparing the true distribution and the approximated kernel density estimator function.
The estimated CDF and quantile function seem to be close to the true functions. This is also thanks to the small bandwith we chose.
We now define a new function which aims to identify the \(h\) that satisfies the following condition: \[ \underset{h \in [0,1]}{max} | \qquad W_{L_1,1} (f, \widehat(f))\leq \epsilon \] where \[W_{L_1,1} (f, \widehat(f)) = \int_0^1|F^{-1}(z)-\widehat{F}_h^{-1}(z)|dz\] where \(F^{-1}(z)\) and $ _h^{-1}(z)$ are respectively the quantile functions of the beta distribution and it’s estimation.
The function find_bandwidth computes the Wasserstein
distance for a sequence of bandwidth values, keeps only those satisfying
a specific inequality, and then selects the maximum value. This ensures
the coarsest approximation for the bandwidth, guaranteeing an
approximation lower than or equal to epsilon. It also returns a plot of
all the bandwiths computed and the threshold. Then prints the \(h\) selected that respects our
condition.
The navy curve represents how the Wasserstein distance changes as the bandwidth of the boxcar_kde varies. The y-axis represents the different bandwidth values, and the x-axis represents the corresponding Wasserstein distances. The green line represent the threshold \(\epsilon\) we seek.
We see that as \(h\) increases also the Wasserstein distance increases as we get less precise estimates. Choosing always the coarsest approximation for the bandwidth allows us to avoid picking too small values of \(h\) that would fit the data too well (overfit) and not be able to generalize and understand the true shape of the distribution.
Now we will change the parameters \(\alpha\) and \(\beta\) and then repeat all the steps above.
The graph compares the density distribution of data2 (histogram) with the Boxcar Kernel Density Estimate (green) and the true Beta Density (cyan). It describes how well the Boxcar Kernel Density Estimate aligns with the true density of the data. Even though the data is less regular than before, the boxcar KDE function is still able to capture quite well the true distribution.
Next, we compute again the CDF and quantile function.
We see that our estimated CDF is still able to capture the shape of the distribution.
Finally, we compute again the Wasserstein distances.
We can notice than in both case the maximum distance is up to around \(0.25\) in these two cases. It means that there’s not an \(h\) so bad that the approximation is extremely distant. We can also notice tough that the range varies a lot depending on the parameters of the Beta. Especially when the Beta distribution has \(\alpha=\beta\) and \(h\) is high, it performes poorly.